Improve plots and fix exploratory notebook issues#596
Conversation
e8d33a3 to
6705027
Compare
There was a problem hiding this comment.
Pull request overview
This PR makes a set of fixes and improvements to exploratory Jupyter notebooks and the related chart-generation Python scripts. It adds nbformat so Plotly can render figures in notebooks, repoints notebooks at their new queries/ folders (and ../../../cypher/... for shared Cypher), adds a new cyclomatic-complexity distribution chart + CSV report for the Java domain, drops the no-longer-needed openTSNE dimensionality reduction in favour of UMAP, and introduces a CI workflow that smoke-executes every explore/*.ipynb to catch import/syntax regressions.
Changes:
- Notebook hygiene: type annotations, path corrections, removal of dead helpers (
get_plotly_figure_write_image_settings, t-SNE path), stringification of date columns for Plotly JSON serialization,NEO4J_INITIAL_PASSWORDvalidation. - New chart pipeline:
Cyclomatic_Method_Complexity_Distribution.cypher+ CSV export + normalized per-artifact line chart injavaCharts.py; pie charts inexternalDependencyCharts.pynow plot the full dataset and rely on threshold filtering instead of pre-head(20). - New
internal-check-notebooks.ymlworkflow that executes notebooks withnbconvert --allow-errorsand fails only onModuleNotFoundError/ImportError/SyntaxError.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Add nbformat==5.10.4 (Plotly notebook rendering). |
| conda-environment.yml | Mirror the new nbformat=5.10.4 pin. |
| uv.lock | Lockfile updates for nbformat and transitive deps. |
| domains/java/queries/method-metrics/Cyclomatic_Method_Complexity_Distribution.cypher | New query (description has a duplicated word; uses lowercase asc). |
| domains/java/javaCsv.sh | Export the new cyclomatic-complexity CSV. |
| domains/java/javaCharts.py | Refactor line-count chart to normalized line chart; add cyclomatic chart. |
| domains/java/explore/MethodMetricsJavaExploration.ipynb | Notebook fixes & path update for the moved cypher file. |
| domains/external-dependencies/externalDependencyCharts.py | Drop unnecessary head(20) calls; widen < to <= in drill-down filter (inconsistent with grouping). |
| domains/external-dependencies/explore/ExternalDependenciesJava.ipynb | Type annotations, password validation, path updates, bug-fix variable rename in drill-down cell. |
| domains/external-dependencies/explore/ExternalDependenciesTypescript.ipynb | Same modernization & path updates as the Java notebook. |
| domains/git-history/explore/GitHistoryGeneralExploration.ipynb | Cast Plotly date columns to strings; remove SVG write helper; metadata Python version downgraded (3.12.8) and diverges from conda env. |
| domains/node-embeddings/explore/NodeEmbeddingsJavaExploration.ipynb | Path updates and type-ignore pragmas. |
| domains/node-embeddings/explore/NodeEmbeddingsTypescriptExploration.ipynb | Same path/pragma updates. |
| domains/anomaly-detection/explore/NodeEmbeddingsHyperparameterTuningExploration.ipynb | Replace t-SNE with UMAP, add copy=True to HDBSCAN, cypher path updates; stray execution_count: 43. |
| .github/workflows/internal-check-notebooks.yml | New matrix workflow that smoke-tests all explore notebooks. |
6705027 to
ddf3dea
Compare
5a0913f to
5a0c57f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 17 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (1)
domains/external-dependencies/externalDependencyCharts.py:535
- Removing the
top20 = ...head(20)pre-slicing here (and in all the othersave_pie_chart_paircallsites below in this function and ingenerate_typescript_charts) meanssave_pie_chart_pairnow receives the full overall/spread datasets, which can contain hundreds of rows. The chart relies entirely ongroup_small_values_into_others's percentage threshold to keep the pie readable. Please verify on a large project (e.g. AxonFramework) that the resulting pie charts and legends remain legible — without the explicit top-20 cap, charts with many low-but-not-tiny-percentage slices can become unreadable. If acceptable visually, consider documenting that the "Top N" naming inchart_name_prefix(Java_Top_external_packages_by_types, etc.) no longer reflects a strict top-N selection.
# ── Top external packages (Table 1 equivalent) ────────────────────────────
if not overall_data.empty:
save_pie_chart_pair(
source_data=overall_data,
value_column="numberOfExternalCallerTypes",
name_column="externalPackageName",
chart_name_prefix="Java_Top_external_packages_by_types",
primary_threshold_percent=0.7,
report_directory=report_directory,
verbose=verbose,
)
save_pie_chart_pair(
source_data=overall_data,
value_column="numberOfExternalCallerPackages",
name_column="externalPackageName",
chart_name_prefix="Java_Top_external_packages_by_packages",
primary_threshold_percent=0.7,
report_directory=report_directory,
verbose=verbose,
)
# ── Second-level package grouping (Table 2 equivalent) ────────────────────
if not second_level_overall_data.empty:
save_pie_chart_pair(
source_data=second_level_overall_data,
value_column="numberOfExternalCallerTypes",
name_column="externalSecondLevelPackageName",
chart_name_prefix="Java_Top_second_level_packages_by_types",
primary_threshold_percent=0.7,
report_directory=report_directory,
verbose=verbose,
)
save_pie_chart_pair(
source_data=second_level_overall_data,
value_column="numberOfExternalCallerPackages",
name_column="externalSecondLevelPackageName",
chart_name_prefix="Java_Top_second_level_packages_by_packages",
primary_threshold_percent=0.7,
report_directory=report_directory,
verbose=verbose,
)
# ── Most spread external packages (Table 3 equivalent) ────────────────────
if not spread_data.empty:
save_pie_chart_pair(
source_data=spread_data,
value_column="sumNumberOfTypes",
name_column="externalPackageName",
chart_name_prefix="Java_Most_spread_packages_by_types",
primary_threshold_percent=0.5,
report_directory=report_directory,
verbose=verbose,
)
save_pie_chart_pair(
source_data=spread_data,
value_column="sumNumberOfPackages",
name_column="externalPackageName",
chart_name_prefix="Java_Most_spread_packages_by_packages",
primary_threshold_percent=0.5,
report_directory=report_directory,
verbose=verbose,
)
# ── Most spread second-level packages (Table 4 equivalent) ────────────────
if not second_level_spread_data.empty:
save_pie_chart_pair(
source_data=second_level_spread_data,
value_column="sumNumberOfTypes",
name_column="externalSecondLevelPackageName",
chart_name_prefix="Java_Most_spread_second_level_packages_by_types",
primary_threshold_percent=0.5,
report_directory=report_directory,
verbose=verbose,
)
save_pie_chart_pair(
source_data=second_level_spread_data,
value_column="sumNumberOfPackages",
name_column="externalSecondLevelPackageName",
chart_name_prefix="Java_Most_spread_second_level_packages_by_packages",
primary_threshold_percent=0.5,
report_directory=report_directory,
346969a to
9347997
Compare
🚀 Feature
⚙️ Optimization
🛠 Fix
📖 Documentation